reference/mbstring/functions/mb-detect-encoding.xml
34f90a65914c900173f9a42331acc45bc53d8eee
34f90a65914c900173f9a42331acc45bc53d8eee
...
...
@@ -9,14 +9,37 @@
9
9
<refsect1 role="description">
10
10
&reftitle.description;
11
11
<methodsynopsis>
12
-
<type>string</type><methodname>mb_detect_encoding</methodname>
13
-
<methodparam><type>string</type><parameter>str</parameter></methodparam>
14
-
<methodparam choice="opt"><type>mixed</type><parameter>encoding_list</parameter><initializer>mb_detect_order()</initializer></methodparam>
15
-
<methodparam choice="opt"><type>bool</type><parameter>strict</parameter><initializer>false</initializer></methodparam>
12
+
<type class="union"><type>string</type><type>false</type></type><methodname>mb_detect_encoding</methodname>
13
+
<methodparam><type>string</type><parameter>string</parameter></methodparam>
14
+
<methodparam choice="opt"><type class="union"><type>array</type><type>string</type><type>null</type></type><parameter>encodings</parameter><initializer>&null;</initializer></methodparam>
15
+
<methodparam choice="opt"><type>bool</type><parameter>strict</parameter><initializer>&false;</initializer></methodparam>
16
16
</methodsynopsis>
17
17
<para>
18
-
Detects character encoding in <type>string</type> <parameter>str</parameter>.
18
+
Detects the most likely character encoding for <type>string</type> <parameter>string</parameter>
19
+
from an ordered list of candidates.
19
20
</para>
21
+
<para>
22
+
Automatic detection of the intended character encoding can never be entirely reliable;
23
+
without some additional information, it is similar to decoding an encrypted string
24
+
without the key. It is always preferable to use an indication of character encoding
25
+
stored or transmitted with the data, such as a "Content-Type" HTTP header.
26
+
</para>
27
+
<para>
28
+
This function is most useful with multibyte encodings, where not all sequences of
29
+
bytes form a valid string. If the input string contains such a sequence, that
30
+
encoding will be rejected, and the next encoding checked.
31
+
</para>
32
+
33
+
<warning>
34
+
<title>The result is not accurate</title>
35
+
<para>
36
+
The name of this function is misleading, it performs "guessing" rather than "detection".
37
+
</para>
38
+
<para>
39
+
The guesses are far from accurate, and therefore you cannot use this function to accurately
40
+
detect the correct character encoding.
41
+
</para>
42
+
</warning>
20
43
</refsect1>
21
44
22
45
<refsect1 role="parameters">
...
...
@@ -24,24 +47,25 @@
24
47
<para>
25
48
<variablelist>
26
49
<varlistentry>
27
-
<term><parameter>str</parameter></term>
50
+
<term><parameter>string</parameter></term>
28
51
<listitem>
29
52
<para>
30
-
The <type>string</type> being detected.
53
+
The <type>string</type> being inspected.
31
54
</para>
32
55
</listitem>
33
56
</varlistentry>
34
57
<varlistentry>
35
-
<term><parameter>encoding_list</parameter></term>
58
+
<term><parameter>encodings</parameter></term>
36
59
<listitem>
37
60
<para>
38
-
<parameter>encoding_list</parameter> is list of character
39
-
encoding. Encoding order may be specified by array or comma
40
-
separated list string.
61
+
A list of character encodings to try, in order. The list may be specified as
62
+
an array of strings, or a single string separated by commas.
41
63
</para>
42
64
<para>
43
-
If <parameter>encoding_list</parameter> is omitted,
44
-
detect_order is used.
65
+
If <parameter>encodings</parameter> is omitted or &null;,
66
+
the current detect_order (set with the <link linkend="ini.mbstring.detect-order">
67
+
mbstring.detect_order</link> configuration option, or <function>mb_detect_order</function>
68
+
function) will be used.
45
69
</para>
46
70
</listitem>
47
71
</varlistentry>
...
...
@@ -49,9 +73,16 @@
49
73
<term><parameter>strict</parameter></term>
50
74
<listitem>
51
75
<para>
52
-
<parameter>strict</parameter> specifies whether to use
53
-
the strict encoding detection or not.
54
-
Default is &false;.
76
+
Controls the behaviour when <parameter>string</parameter>
77
+
is not valid in any of the listed <parameter>encodings</parameter>.
78
+
If <parameter>strict</parameter> is set to &false;, the closest matching
79
+
encoding will be returned; if <parameter>strict</parameter> is set to &true;,
80
+
&false; will be returned.
81
+
</para>
82
+
<para>
83
+
The default value for <parameter>strict</parameter> can be set
84
+
with the <link linkend="ini.mbstring.strict-detection">
85
+
mbstring.strict_detection</link> configuration option.
55
86
</para>
56
87
</listitem>
57
88
</varlistentry>
...
...
@@ -62,11 +93,37 @@
62
93
<refsect1 role="returnvalues">
63
94
&reftitle.returnvalues;
64
95
<para>
65
-
The detected character encoding or &false; if the encoding cannot be
66
-
detected from the given string.
96
+
The detected character encoding, or &false; if the string is not valid
97
+
in any of the listed encodings.
67
98
</para>
68
99
</refsect1>
69
100
101
+
<refsect1 role="changelog">
102
+
&reftitle.changelog;
103
+
<informaltable>
104
+
<tgroup cols="2">
105
+
<thead>
106
+
<row>
107
+
<entry>&Version;</entry>
108
+
<entry>&Description;</entry>
109
+
</row>
110
+
</thead>
111
+
<tbody>
112
+
<row>
113
+
<entry>8.2.0</entry>
114
+
<entry>
115
+
<function>mb_detect_encoding</function> will no longer return
116
+
the following non text encodings:
117
+
<literal>"Base64"</literal>, <literal>"QPrint"</literal>,
118
+
<literal>"UUencode"</literal>, <literal>"HTML entities"</literal>,
119
+
<literal>"7 bit"</literal> and <literal>"8 bit"</literal>.
120
+
</entry>
121
+
</row>
122
+
</tbody>
123
+
</tgroup>
124
+
</informaltable>
125
+
</refsect1>
126
+
70
127
<refsect1 role="examples">
71
128
&reftitle.examples;
72
129
<para>
...
...
@@ -75,23 +132,112 @@
75
132
<programlisting role="php">
76
133
<![CDATA[
77
134
<?php
78
-
/* Detect character encoding with current detect_order */
79
-
echo mb_detect_encoding($str);
80
135
81
-
/* "auto" is expanded according to mbstring.language */
82
-
echo mb_detect_encoding($str, "auto");
136
+
$str = "\x95\xB6\x8E\x9A\x83\x52\x81\x5B\x83\x68";
83
137
84
-
/* Specify encoding_list character encoding by comma separated list */
85
-
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");
138
+
// Detect character encoding with current detect_order
139
+
var_dump(mb_detect_encoding($str));
86
140
87
-
/* Use array to specify encoding_list */
88
-
$ary[] = "ASCII";
89
-
$ary[] = "JIS";
90
-
$ary[] = "EUC-JP";
91
-
echo mb_detect_encoding($str, $ary);
141
+
// "auto" is expanded according to mbstring.language
142
+
var_dump(mb_detect_encoding($str, "auto"));
143
+
144
+
// Specify "encodings" parameter by list separated by comma
145
+
var_dump(mb_detect_encoding($str, "JIS, eucjp-win, sjis-win"));
146
+
147
+
// Use array to specify "encodings" parameter
148
+
$encodings = [
149
+
"ASCII",
150
+
"JIS",
151
+
"EUC-JP"
152
+
];
153
+
var_dump(mb_detect_encoding($str, $encodings));
92
154
?>
93
155
]]>
94
156
</programlisting>
157
+
&example.outputs;
158
+
<screen>
159
+
<![CDATA[
160
+
string(5) "ASCII"
161
+
string(5) "ASCII"
162
+
string(8) "SJIS-win"
163
+
string(5) "ASCII"
164
+
]]>
165
+
</screen>
166
+
</example>
167
+
</para>
168
+
<para>
169
+
<example>
170
+
<title>Effect of <parameter>strict</parameter> parameter</title>
171
+
<programlisting role="php">
172
+
<![CDATA[
173
+
<?php
174
+
// 'áéóú' encoded in ISO-8859-1
175
+
$str = "\xE1\xE9\xF3\xFA";
176
+
177
+
// The string is not valid ASCII or UTF-8, but UTF-8 is considered a closer match
178
+
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], false));
179
+
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], true));
180
+
181
+
// If a valid encoding is found, the strict parameter does not change the result
182
+
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], false));
183
+
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], true));
184
+
?>
185
+
]]>
186
+
</programlisting>
187
+
&example.outputs;
188
+
<screen>
189
+
<![CDATA[
190
+
string(5) "UTF-8"
191
+
bool(false)
192
+
string(10) "ISO-8859-1"
193
+
string(10) "ISO-8859-1"
194
+
]]>
195
+
</screen>
196
+
</example>
197
+
</para>
198
+
<para>
199
+
In some cases, the same sequence of bytes may form a valid string in multiple
200
+
character encodings, and it is impossible to know which interpretation was
201
+
intended. For instance, among many others, the byte sequence "\xC4\xA2" could be:
202
+
</para>
203
+
<para>
204
+
<simplelist>
205
+
<member>
206
+
"Ä¢" (U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS followed by U+00A2 CENT SIGN)
207
+
encoded in any of ISO-8859-1, ISO-8859-15, or Windows-1252
208
+
</member>
209
+
<member>
210
+
"ФЂ" (U+0424 CYRILLIC CAPITAL LETTER EF followed by U+0402 CYRILLIC CAPITAL LETTER
211
+
DJE) encoded in ISO-8859-5
212
+
</member>
213
+
<member>
214
+
"Ģ" (U+0122 LATIN CAPITAL LETTER G WITH CEDILLA) encoded in UTF-8
215
+
</member>
216
+
</simplelist>
217
+
</para>
218
+
<para>
219
+
<example>
220
+
<title>Effect of order when multiple encodings match</title>
221
+
<programlisting role="php">
222
+
<![CDATA[
223
+
<?php
224
+
$str = "\xC4\xA2";
225
+
226
+
// The string is valid in all three encodings, so the first one listed will be returned
227
+
var_dump(mb_detect_encoding($str, ['UTF-8', 'ISO-8859-1', 'ISO-8859-5']));
228
+
var_dump(mb_detect_encoding($str, ['ISO-8859-1', 'ISO-8859-5', 'UTF-8']));
229
+
var_dump(mb_detect_encoding($str, ['ISO-8859-5', 'UTF-8', 'ISO-8859-1']));
230
+
?>
231
+
]]>
232
+
</programlisting>
233
+
&example.outputs;
234
+
<screen>
235
+
<![CDATA[
236
+
string(5) "UTF-8"
237
+
string(10) "ISO-8859-1"
238
+
string(10) "ISO-8859-5"
239
+
]]>
240
+
</screen>
95
241
</example>
96
242
</para>
97
243
</refsect1>
...
...
@@ -106,7 +252,6 @@ echo mb_detect_encoding($str, $ary);
106
252
</refsect1>
107
253
108
254
</refentry>
109
-
110
255
<!-- Keep this comment at the end of the file
111
256
Local variables:
112
257
mode: sgml
113
258