CGATcore IOTools Module¶
iotools.py - Tools for I/O operations¶
This module contains utility functions for reading/writing from files. These include methods for
-
inspecting files, such as :func:
get_first_line
, :func:get_last_line
and :func:is_empty
, -
working with filenames, such as :func:
which
and :func:snip
, :func:check_presence_of_files
-
manipulating file, such as :func:
open_file
, :func:zap_file
, :func:clone_file
, :func:touch_file
. -
converting values for input/output, such as :func:
val2str
, :func:str2val
, :func:pretty_percent
, :func:human2bytes
, :func:convert_dictionary_values
. -
iterating over file contents, such as :func:
iterate
, :func:iterator_split
, -
creating lists/dictionaries from files, such as :func:
readMap
and :func:read_list
, and -
working with file collections (see :class:
FilePool
).
API¶
FilePool
¶
manage a pool of output files.
This class will keep a large number of files open. To see if you can handle this, check the limit within the shell::
ulimit -n
The number of currently open and maximum open files in the system:
cat /proc/sys/fs/file-nr
Changing these limits might not be easy without root privileges.
The maximum number of files opened is given by :attr:maxopen
.
This class is inefficient if the number of files is larger than
:attr:maxopen
and calls to write
do not group keys together.
To use this class, create a FilePool and write to it as if it was a single file, specifying a section for each write::
pool = FilePool("%s.tsv")
for value in range(100):
for section in ("file1", "file2", "file3"):
pool.write(section, str(value) + ",")
This will create three files called file1.tsv
, file2.tsv
,
file3.tsv
, each containing the numbers from 0 to 99.
The FilePool acts otherwise as a dictionary providing access to the number of times an item has been written to each file::
print pool["file1]
print pool.items()
Parameters¶
string
output pattern to use. Should contain a "%s". If set to None, the pattern "%s" will be used.
header : string optional header to write when writing to a file the first time. force : bool overwrite existing files. All files matching the pattern will be deleted.
Source code in cgatcore/iotools.py
857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 |
|
__del__()
¶
close()
¶
deleteFiles(min_size=0)
¶
delete all files below a minimum size min_size
bytes.
Source code in cgatcore/iotools.py
getFilename(identifier)
¶
open_file(filename, mode='w')
¶
open file.
If file is in a new directory, create directories.
Source code in cgatcore/iotools.py
setHeader(header)
¶
write(identifier, line)
¶
write line
to file specified by identifier
Source code in cgatcore/iotools.py
FilePoolMemory
¶
Bases: FilePool
manage a pool of output files in memory.
The usage is the same as :class:FilePool
but the data is cached
in memory before writing to disk.
Source code in cgatcore/iotools.py
__del__()
¶
close()
¶
close all open files. writes the data to disk.
Source code in cgatcore/iotools.py
nested_dict
¶
Bases: defaultdict
Auto-vivifying nested dictionaries.
For example::
nd= nested_dict() nd["mouse"]["chr1"]["+"] = 311
Source code in cgatcore/iotools.py
iterflattened()
¶
iterate through values with nested keys flattened into a tuple
Source code in cgatcore/iotools.py
bytes2human(n, format='%(value).1f%(symbol)s', symbols='customary')
¶
Convert n bytes into a human readable string based on format. symbols can be either "customary", "customary_ext", "iec" or "iec_ext", see: http://goo.gl/kTQMs
bytes2human(0) '0.0B' bytes2human(0.9) '0.0B' bytes2human(1) '1.0B' bytes2human(1.9) '1.0B' bytes2human(1024) '1.0K' bytes2human(1048576) '1.0M' bytes2human(1099511627776127398123789121) '909.5Y'
bytes2human(9856, symbols="customary") '9.6K' bytes2human(9856, symbols="customary_ext") '9.6kilo' bytes2human(9856, symbols="iec") '9.6Ki' bytes2human(9856, symbols="iec_ext") '9.6kibi'
bytes2human(10000, "%(value).1f %(symbol)s/sec") '9.8 K/sec'
precision can be adjusted by playing with %f operator¶
bytes2human(10000, format="%(value).5f %(symbol)s") '9.76562 K'
Author: Giampaolo Rodola'
Source code in cgatcore/iotools.py
check_presence_of_files(filenames)
¶
check for the presence/absence of files
Parameters¶
filenames : list Filenames to check for presence.
Returns¶
missing : list List of missing filenames
Source code in cgatcore/iotools.py
clone_file(infile, outfile)
¶
create a clone of infile
named outfile
by creating a soft-link.
Source code in cgatcore/iotools.py
convert_dictionary_values(d, map={})
¶
convert string values in a dictionary to numeric types.
Arguments
d : dict
The dictionary to convert
map : dict
If map contains 'default', a default conversion is enforced.
For example, to force int for every column but column id
,
supply map = {'default' : "int", "id" : "str" }
Source code in cgatcore/iotools.py
flatten(nested_list, ltypes=(list, tuple))
¶
flatten a nested list.
This method works with any list-like container such as tuples.
Arguments¶
nested_list : list A nested list. ltypes : list A list of valid container types.
Returns¶
list : list A flattened list.
Source code in cgatcore/iotools.py
force_str(iterator, encoding='ascii')
¶
get_first_line(filename, nlines=1)
¶
return the first line of a file.
Arguments¶
filename : string The name of the file to be opened. nlines : int Number of lines to return.
Returns¶
string The first line(s) of the file.
Source code in cgatcore/iotools.py
get_last_line(filename, nlines=1, read_size=1024, encoding='utf-8')
¶
return the last line of a file.
This method works by working back in blocks of read_size
until
the beginning of the last line is reached.
Arguments¶
filename : string Name of the file to be opened. nlines : int Number of lines to return. read_size : int Number of bytes to read.
Returns¶
string The last line(s) of the file.
Source code in cgatcore/iotools.py
get_num_lines(filename, ignore_comments=True)
¶
count number of lines in filename.
Arguments¶
filename : string
Name of the file to be opened.
ignore_comments : bool
If true, ignore lines starting with #
.
Returns¶
int The number of line(s) in the file.
Source code in cgatcore/iotools.py
human2bytes(s)
¶
Attempts to guess the string format based on default symbols set and return the corresponding bytes as an integer. When unable to recognize the format ValueError is raised.
human2bytes('0 B') 0 human2bytes('1 K') 1024 human2bytes('1 M') 1048576 human2bytes('1 Gi') 1073741824 human2bytes('1 tera') 1099511627776
human2bytes('0.5kilo') 512 human2bytes('0.1 byte') 0 human2bytes('1 k') # k is an alias for K 1024 human2bytes('12 foo') Traceback (most recent call last): ... ValueError: can't interpret '12 foo'
Author: Giampaolo Rodola'
Source code in cgatcore/iotools.py
invert_dictionary(dict, make_unique=False)
¶
returns an inverted dictionary with keys and values swapped.
Source code in cgatcore/iotools.py
is_complete(filename)
¶
return True if file exists and is complete.
A file is complete if its last line contains
job finished
.
Source code in cgatcore/iotools.py
is_empty(filename)
¶
is_nested(container)
¶
return true if container is a nested data structure.
A nested data structure is a dict of dicts or a list of list, but not a dict of list or a list of dicts.
Source code in cgatcore/iotools.py
iterate(infile)
¶
iterate over infile and return a :py:class:collections.namedtuple
according to a header in the first row.
Lines starting with #
are skipped.
Source code in cgatcore/iotools.py
iterate_tabular(infile, sep='\t')
¶
iterate over file infile
skipping lines starting with
#
.
Within a line, records are separated by sep
.
Yields¶
tuple Records within a line
Source code in cgatcore/iotools.py
iterator_split(infile, regex)
¶
Return an iterator of file chunks based on a known logical start
point regex
that splits the file into intuitive chunks. This
assumes the file is structured in some fashion. For arbitrary
number of bytes use file.read(bytes
). If a header is present it
is returned as the first file chunk.
infile must be either an open file handle or an iterable.
Source code in cgatcore/iotools.py
nested_iter(nested)
¶
iterate over the contents of a nested data structure.
The nesting can be done both as lists or as dictionaries.
Arguments¶
nested : dict A nested dictionary
Yields¶
pair: tuple A container/key/value triple
Source code in cgatcore/iotools.py
open_file(filename, mode='r', create_dir=False, encoding='utf-8')
¶
open file called filename with mode mode.
gzip - compressed files are recognized by the
suffix .gz
and opened transparently.
Note that there are differences in the file like objects returned, for example in the ability to seek.
Arguments¶
filename : string mode : string File opening mode create_dir : bool If True, the directory containing filename will be created if it does not exist.
Returns¶
File or file-like object in case of gzip compressed files.
Source code in cgatcore/iotools.py
pickle(file_name, obj)
¶
pretty_percent(numerator, denominator, format='%5.2f', na='na')
¶
output a percent value or "na" if not defined
Source code in cgatcore/iotools.py
pretty_string(val)
¶
readMultiMap(infile, columns=(0, 1), map_functions=(str, str), both_directions=False, has_header=False, dtype=dict)
¶
read a map (pairs of values) from infile.
In contrast to :func:readMap
, this method permits multiple
entries for the same key.
Arguments¶
infile : File File object to read from columns : tuple Columns (A, B) to take from the file to create the mapping from A to B. map_functions : tuple Functions to convert the values in the rows to the desired object types such as int or float. both_directions : bool If true, both mapping directions are returned in a tuple, i.e., A->B and B->A. has_header : bool If true, ignore first line with header. dtype : function datatype to use for the dictionaries.
Returns¶
map : dict
A dictionary containing the mapping. If both_directions
is true,
two dictionaries will be returned.
Source code in cgatcore/iotools.py
1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 |
|
read_list(infile, column=0, map_function=str, map_category={}, with_title=False)
¶
read a list of values from infile.
Arguments¶
infile : File File object to read from columns : int Column to take from the file. map_function : function Function to convert the values in the rows to the desired object types such as int or float. map_category : dict When given, automatically transform/map the values given this dictionary. with_title : bool If true, first line of file is title and will be ignored.
Returns¶
list : list A list with the values.
Source code in cgatcore/iotools.py
read_map(infile, columns=(0, 1), map_functions=(str, str), both_directions=False, has_header=True, dtype=dict)
¶
read a map (key, value pairs) from infile.
If there are multiple entries for the same key, only the last entry will be recorded.
Arguments¶
infile : File File object to read from columns : tuple Columns (A, B) to take from the file to create the mapping from A to B. map_functions : tuple Functions to convert the values in the rows to the desired object types such as int or float. both_directions : bool If true, both mapping directions are returned. has_header : bool If true, ignore first line with header. dtype : function datatype to use for the dictionaries.
Returns¶
map : dict
A dictionary containing the mapping. If both_directions
is true,
two dictionaries will be returned.
Source code in cgatcore/iotools.py
1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 |
|
snip(filename, extension=None, alt_extension=None, strip_path=False)
¶
return prefix of filename
, that is the part without the
extension.
If extension
is given, make sure that filename has the
extension (or alt_extension
). Both extension or alt_extension
can be list of extensions.
If strip_path
is set to true, the path is stripped from the file
name.
Source code in cgatcore/iotools.py
str2val(val, na='na', list_detection=False)
¶
guess type (int, float) of value.
If val
is neither int nor float, the value
itself is returned.
Source code in cgatcore/iotools.py
text_to_dict(filename, key=None, sep='\t')
¶
make a dictionary from a text file keyed on the specified column.
Source code in cgatcore/iotools.py
touch_file(filename, mode=438, times=None, dir_fd=None, ref=None, **kwargs)
¶
update/create a sentinel file.
modified from: https://stackoverflow.com/questions/1158076/implement-touch-using-python
Compressed files (ending in .gz) are created as empty 'gzip' files, i.e., with a header.
Source code in cgatcore/iotools.py
unpickle(file_name)
¶
val2list(val)
¶
val2str(val, format='%5.2f', na='na')
¶
return a formatted value.
If value does not fit format string, return "na"
Source code in cgatcore/iotools.py
which(program)
¶
check if program
is in PATH and is executable.
Returns¶
string The full path to the program. Returns None if not found.
Source code in cgatcore/iotools.py
writeMatrix(outfile, matrix, row_headers, col_headers, row_header='')
¶
write a numpy matrix to outfile.
row_header gives the title of the rows
Source code in cgatcore/iotools.py
write_lines(outfile, lines, header=False)
¶
expects [[[line1-field1],[line1-field2 ] ],... ]
Source code in cgatcore/iotools.py
write_matrix(outfile, matrix, row_headers, col_headers, row_header='')
¶
write a numpy matrix to outfile. row_header gives the title of the rows
Source code in cgatcore/iotools.py
write_table(outfile, table, columns=None, fillvalue='')
¶
write a table to outfile.
If table is a dictionary, output columnwise. If columns is a list, only output columns in columns in the specified order.
.. note:: Deprecated use pandas dataframes instead
Source code in cgatcore/iotools.py
zap_file(filename)
¶
replace filename with empty file.
File attributes such as accession times are preserved.
If the file is a link, the link will be broken and replaced with an empty file having the same attributes as the file linked to.
Returns¶
stat_object A stat object of the file cleaned. link_destination : string If the file was a link, the file being linked to.