Encoding Detection

When travelling across the web and gathering content, you may have found the following html tag:

<html>
  <head>
    ...
    <meta charset="UTF-8">
  </head>
</html>

This an element that will help you detect a character set encoding for the page to convert from bytes to a readable text. In some cases, this isn't present. So, you need to pick a character detection library to infer the encoding from raw bytes on older web content.

In python, lets say you have a script processing this unknown encoding data. You could reach for a character detection library, like chardet or charset-normalizer. This would solve your problem with ease! In the case of charset-normalizer, this may be all that you need to get going.

We didn't come here to read about pure python libraries though, did we? How do our web browsers do this work, often in c/c++? How much faster can we go if we break out of python for this approach? Can we improve accuracy by relying on deeper, more sound and tested libraries?

In Firefox, chardetng is used. This a rust based detector for legacy content encoding, and offers some very impressive accuracy and lofty goals for effective detection. You can read more about the motivations on henri's blog entry introducing the library.

How do we bridge from a Rust library and use it from python code? The current project to do this is PyO3, which maintains bindings between python abi's and Rust. This enables us as developers to tap into python with Rust based primitives, and there are all kinds of projects sweeping the python ecosystem using Rust to speed up hot code paths with poor python performance for native compilation.

So, we can use maturin to bootstrap a project. Maturin is a build tool for building rust based python packages. If we can then take a vector of bytes from our python program, and use a function like so:

from rs_chardet import detect_encoding
guesssed_codec = detect_encoding(b'ハローワールド¥n')

where detect_encoding would have a signature like:

def detect_encoding(b: bytes) -> Optional[CodecInfo]:
    ...

Then we could get all the benefits of using the built in charset detector from Firefox, and native compilation, and replace whole cloth our pure python dependency. In fact, you can see this tool here at rs_chardet and some benchmarks. I've also published this with the new trusted workflows publishing Github Action from the PyPA team. So we've got a securely published and source available package utilizing rust based libraries for character detection in Firefox, and all you have to do is:

python -m venv venv
. venv/bin/activate
python -m pip install rs_chardet

And you are all set! Feel free to collaborate further over in Github if you have any questions about the library or it's API/design.