Messy code issue

ReZero lol

Source: https://www.ibm.com/developerworks/cn/java/analysis-and-summary-of-common-random-code-problems/index.html?cm_mmc=dwchina-_-homepage-_-social-_-weibo#N101F9

Reason

  1. Encode

  2. Decode

  3. Lack of a font library

Analysis phenomenon

  1. Caused by encoding

In English Windows, u create a txt, type and save “你好”. Then u will see “??” after u open it.

  • Reason:
    Windows uses ANSI encode by default, and locale of Ewin is English, which mapping codepage 437 as the encode way is ISO-8859-1. This cause all chinese symbols will be mapping “3F3F” as encode result. And 3F reach “?”.

  • Solution:
    No decode way could display that right characters. So we should choose the right encode way when we save double byte character doc such as GB2312 or UTF-8 as simple chinese while BIG5 or UTF-8 in complex chinese. For chinese user, changing the locale to Chinese also a good idea.

  1. Caused by decoding

Create a txt with “你好”, and copy it to Ewin. Then open it and get the error.

  • Reason:
    Cwin create txt used ANSI as GB2312, and after copy it to Ewin, notepad will use ISO-8859-1 as decode way.

  • Solution:
    Select the right decode method.

  1. Caused by application function.

Open the uedit32.exe(cn version) and get the messy code.

  • Reason: Windows will use Unicode if the application support Unicode or use the ANSI(Which means as the country decided standard encode method)

  • Solution: Edit the Regional and language options: set the standard and format and non-Unicode as simple chinese. Then the system will decode use ANSI.

  1. Caused by lack of font

Open file and get square symbol.

  • Reason: From binary byte sequence to code point, then to character which is found from font library. Then show as lattice on the screen. If not fonud, then use square to replace it.

  • Solution: Setup the library.

Think in coding

  1. I/O operation: read is decode(byte->character) while write is encode(character->byte)

  2. Here is the java I/O interface:

When we use Writer and FileOutputStream:

  1. String.getBytes.

String.getBytes(): Encodes this String into a sequence of bytes using the platform’s default charset(Charset.defaultCharset(), which is decided by system attribute file.encoding), storing the result into a new byte array.

Note: if use do not set the jvm’s file.encoding, it will depend on the environment which start the JVM: If cmd, then use regional language while eclipse could set this attribute.

List[1]. String.getBytes() display messy code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public static void main(String[] args) {
private static final String fileName = "c:\\log.txt" ;
String str ="你好,中国";
writeError(str);
}

private static void writeError(String a_error) {
try {
File logFile = new File(fileName);
//创建字节流对象
FileOutputStream outPutStream = new FileOutputStream(logFile, true);
//使用平台的默认字符集将此字符串编码为一系列字节
outPutStream.write(a_error.getBytes(), 0, a_error.length() );
outPutStream.flush();
} catch (IOException e) {
e.printStackTrace();
}
}

List[2].outputStreamWrite to set character library

1
2
3
4
5
6
7
8
9
10
11
12
13
14
private static void writeErrorWithCharSet(String a_error) {
try {
File logFile = new File(FileName);
String charsetName = "utf-8";
//指定字符字节转换时使用的字符集为 Unicode,编码方式为 utf-8
Writer m_write = new BufferedWriter(
new OutputStreamWriter(new java.io.FileOutputStream(logFile, true),
charsetName) );
m_write.write(a_error);
m_write.close();
} catch (IOException e) {
e.printStackTrace();
}
}

To avoid messy code issue, when call the I/O api, u had better to use the overload format with pointing library args.

Web Application

Reason:

  1. Browser not followed the URI encode standard. Server not config the encode and decode. Devloper’s error.

  2. GET method: encode the non-ASCII character by urlencode.

域名:端口/contextPath/servletPath/pathInfo?queryString PathInfo and queryString will depend on the server. Tomcat always set them on the server.xml, pathInfo part decode character library is defined on the connector’s , and queryString was by useBodyEncodingForURI(if not set, tomcat will use UTF-8:version >= 8.0)

To avoid the encode which we do not want, we had better use ASCII only(or urlencode first) on the url.

  1. Post method: Browser will check the contentType(“text/html;charset=utf-8”) then encode form by using it.

<%@ page language="java" contentType="text/html; charset="GB18030" pageEncoding="UTF-8"%> pageEncoding is how to save the jsp file.

list[3] POST request set setContentType

1
2
3
4
5
6
7
8
9
10
11
12
13
14
protected void doPost(HttpServletRequest request, HttpServletResponse
response) throws ServletException, IOException {
if(!ServletFileUpload.isMultipartContent(request)){
throw new ServletException("Content type is not multipart/form-data");
}
response.setCharacterEncoding("UTF-8");//设置响应编码
response.setContentType("text/html;charset=UTF-8");
PrintWriter out = response.getWriter();
out.write("<html><head></head><body>");
try {
List<FileItem> items = (List<FileItem>)
uploader.parseRequest(request);

}

JSP, use post method to do request

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<%@ page language="java" contentType="text/html; charset=utf-8" pageEncoding="utf-8"%>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

<title>index</title>

<meta http-equiv="pragma" content="no-cache">

<meta http-equiv="cache-control" content="no-cache">

<meta http-equiv="expires" content="0">

</head>

<body>

<form action="FileUploadServlet" method="post" enctype="multipart/form-data">

选择上传文件:<input type="file" name="fileName">

<br>

<input type="submit" value="上传">

</form>

</body>

</html>
- Browser display: Chrome use jsp contentType and charset while firefox use text encoding.

- For jsp(html): jsp will saved as pageEncoding, if not ponit it, then use charset, if not charset, then as default ISO-8859-1. Charset reponse for notify the browser how to decode web page.

- For dynamic: Server use HttpServletResponse.setContentType to set http header's contentType.

File name be messy code when downloading

Reason: Header only support ASCII library, and encode other character to 3F(?)

Solution: urlEncode.encode(filename, charset) at first, then put it on the header.

list[4]

1
2
3
4
5
6
7
8
protected void doGet(HttpServletRequest request, HttpServletResponse
response) throws ServletException, IOException {
String fileName = getDecodeParameter(request,"fileName");
String userName = getDecodeParameter(request, "username");
response.setHeader("Content-Disposition", "attachment; filename=\"" +
URLEncoder.encode(fileName,"utf-8") + "\";userName=\"" +
URLEncoder.encode(userName,"utf-8") + "\"");
}

DataBase operation

Bridge: Unicode

Server database, client system, client environment varible.

Create databse using utf-8, and SQL NCHAR could solve the multi-language issues.

Deep in analyzing the web request

Referring RFC

Deep in analyzing java cnEncode

Unicode Encode standard

  • Post title:Messy code issue
  • Post author:ReZero
  • Create time:2018-01-23 16:55:51
  • Post link:https://rezeros.github.io/2018/01/23/messy-code/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments