• Login
sketchucation logo sketchucation
  • Login
πŸ”Œ Quick Selection | Try Didier Bur's reworked classic extension that supercharges selections in SketchUp Download

[Code] file_found?(path) and to_ascii+to_unicode.rb

Scheduled Pinned Locked Moved Developers' Forum
42 Posts 5 Posters 5.9k Views 5 Watching
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • T Offline
    thomthom
    last edited by 7 Jul 2009, 06:38

    Sorry, UTF-8 uses one-four bytes...

    Thomas Thomassen β€” SketchUp Monkey & Coding addict
    List of my plugins and link to the CookieWare fund

    1 Reply Last reply Reply Quote 0
    • T Offline
      TIG Moderator
      last edited by 7 Jul 2009, 07:43

      This is an updated version - Notethat if you have code containing an earlier version then to+ascii() becomes to_unicode() and vice versa - you will need to make adjustments...

      TIG (c) 2009
      Ruby Code: to_ascii+to_unicode.rb
      Usage: text = to_unicode(txt_ascii)
      text = to_ascii(txt_unicode)
      Returns text either converted to Unicode or ASCII characters...
      Why ?
      SUp returns model.path, txt=UI.openpanel("?","c:\",".txt") etc
      in Unicode characters, whereas the Ruby system returns things in ASCII:
      so e.g. FileTest.exist?(txt) wrongly returns 'false' - it fails to see
      the match althouigh the file really exists.
      However,
      txt = UI.openpanel("?","c:\","
      .txt")
      FileTest.exist?(to_ascii(txt))
      correctly returns 'true', as does...
      FileTest.exist?(to_ascii(Sketchup.active_model.path))

      (Note you no longer need to use the limited and clunky
      'file_found?(path)' ruby)

      If for some reason you have ASCII text it can be made into Unicode
      using the other form...
      text = to_unicode(txt)...

      v1.1 20090707 Names swapped round to reflect correct usage - [thanks to thomthom]

      Put in Plugins folder...

      TIG

      1 Reply Last reply Reply Quote 0
      • T Offline
        TIG Moderator
        last edited by 7 Jul 2009, 07:44

        thomthom

        I at last see your explanation now... It's Ruby that's wrong NOT SUp... The more modern coding of SUp means that it returns text that is out of step with Ruby, which is a bit old and clunky when it comes to Unicode ?
        β˜€ 😳

        Whatever it's called it still works !!! ... Perhaps we [I] should have given it another 'neutral' name - however, I have simply swapped over the names and issued an updated version here... http://forums.sketchucation.com/viewtopic.php?p=169274#p169274

        TIG

        1 Reply Last reply Reply Quote 0
        • T Offline
          thomthom
          last edited by 7 Jul 2009, 08:09

          @tig said:

          I at last see your explanation now... It's Ruby that's wrong NOT SUp... The more modern coding of SUp means that it returns text that is out of step with Ruby, which is a bit old and clunky when it comes to Unicode ?

          Yes. That is seen if you type "Γ₯".length in the ruby console. the console also uses Unicode (I think SU uses UTF-8 encoding.) The return value is 2 - which shows that ruby doesn't deal with multi-byte characters.

          I had the very same problem with PHP when I was making websites. It also assumes 8bit per character. I thought PHP was the only dinosaur and I'm very surprised that this new Ruby language doesn't deal with Unicode strings.

          @tig said:

          Whatever it's called it still works !!! ... Perhaps we [I] should have given it another 'neutral' name - however, I have simply swapped over the names and issued an updated version here... viewtopic.php?p=169274#p169274

          Might be good to find some names that relates best to what is actually returned. It's not really ACSII ruby uses either - as mentioned ASCII only uses the first 128 values of a byte. It might be ANSI - which extends ACSII to full 256. But it might be an ISO-8859-x encoding (in which case it's most likely ISO-8859-1 Latin-1 Western European).
          When I look at Norepad++ and Notepad they are both set to encode in ANSI - so my current best guess is ANSI.

          I'll have to have a closer look to what the methods actually do to determine what it actually returns.

          Character encodings are a nightmare. And with SU being UTF-8 and Ruby ANSI(?) - this is just begging for problems. I think there's some UTF-8 libraries which we can use without breaking existing code. Some scripts might rely on Ruby treating multi-byte characters as multiple characters so we can't really modify the existing methods.
          But maybe some of the libraries offer some good conversions tool. Maybe a custom Unicode string class which people can use if they need to deal with unicode characters and string manipulations. Converting from unicode to ANSI and back is so prone to errors. (I spent ages struggling with this when I was making a parser in PHP for my UTF-8 encoded website.)

          I've been meaning to write up a gotcha-thread for ruby scripting - character encoding is one of the important points I need to include.

          Considering the widespread usage of SU in various languages I'm surprised this problem hasn't been mentioned more in the forums.

          Thomas Thomassen β€” SketchUp Monkey & Coding addict
          List of my plugins and link to the CookieWare fund

          1 Reply Last reply Reply Quote 0
          • T Offline
            thomthom
            last edited by 7 Jul 2009, 09:36

            Ok. I'm pretty sure that Ruby uses ANSI. "ANSI" isn't really the name either (character encoding is a topic that will hurt your brain) but actually Windows-1252 . It's not really an ANSI stanrad. Windows-1252 is a superset of ISO-8859-1 .

            UTF-8 which SU uses (I'm pretty sure of this. I was reverse engineering the .skp format and that used an mix of ANSI and UTF-8) is backwards compatible with ASCII - all ASCII characters (the first 128 of UTF-8) are mapped the same in UTF-8. Characters outside the 128 ASCII set is mapped with at least two bytes.

            However, Ruby uses Windows-1252 (ANSI), which also maps the ASCII set to it's first 128 characters, but it has extra characters for it's remaining 128. This means for western european languages we can map back UTF-8 strings back to ANSI with a fair success. This is that the two methods TIG got does.

            I don't know what happens if you try to map UTF-8 characters that doesn't exist in the Windows-1252 set.

            TIG: my suggestion for the method names:
            .to_ansi
            .to_utf8

            This will correctly describe what they do. Unicode can be many things, it could have more bytes per characters in different byte order, so utf8 will give the correct assumption of what will happen.

            Thomas Thomassen β€” SketchUp Monkey & Coding addict
            List of my plugins and link to the CookieWare fund

            1 Reply Last reply Reply Quote 0
            • T Offline
              thomthom
              last edited by 7 Jul 2009, 10:28

              Ok, digging deeper into this, Ruby 1.8 doesn't have an encoding type at all. It simply treats String as a series of bytes (8-bits) http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18

              Since ASCII, Windows-1252 and ISO-8859-? works within the 8bit range they can be handled in Ruby without further processing of the string.

              Thomas Thomassen β€” SketchUp Monkey & Coding addict
              List of my plugins and link to the CookieWare fund

              1 Reply Last reply Reply Quote 0
              • T Offline
                thomthom
                last edited by 7 Jul 2009, 10:37

                
                # ! Could lead to corrupt string if the UTF-8 string can't fit into 8bit encoding.
                def self.to_ansi(str_utf8)
                  return str_utf8.unpack('U*').pack('C*')
                end
                
                # Safe to use.
                def to_utf8(str_ansi)
                  return str_ansi.unpack('C*').pack('U*')
                end
                
                

                Thomas Thomassen β€” SketchUp Monkey & Coding addict
                List of my plugins and link to the CookieWare fund

                1 Reply Last reply Reply Quote 0
                • T Offline
                  thomthom
                  last edited by 7 Jul 2009, 10:40

                  http://blog.grayproductions.net/categories/character_encodings <- looks like some interesting reading.

                  Thomas Thomassen β€” SketchUp Monkey & Coding addict
                  List of my plugins and link to the CookieWare fund

                  1 Reply Last reply Reply Quote 0
                  • T Offline
                    thomthom
                    last edited by 7 Jul 2009, 11:07

                    Not sure if .to_ansi is correct name any more since Ruby have no encoding. Maybe it's an co-incidence that it works on my system (Windows) which uses the Windows-1252 encoding. I expect that return str_utf8.unpack('U*').pack('C*') returns correct ASCII for the first 128 byte set, but it's the rest that has me puzzled. On my system it maps fine to ANSI, but maybe a different system might behave differently when it comes to the accented characters....

                    Thomas Thomassen β€” SketchUp Monkey & Coding addict
                    List of my plugins and link to the CookieWare fund

                    1 Reply Last reply Reply Quote 0
                    • T Offline
                      thomthom
                      last edited by 7 Jul 2009, 11:16

                      Redmine 404 error

                      favicon

                      (redmine.ruby-lang.org)

                      @unknownuser said:

                      I noticed issues with other things, like puts, print and such.

                      **Most of the File and IO functions for Windows are ANSI, not Wide, which limits the options to process properly paths, filenames and even output of strings using UTF/Unicode characters.

                      Also, the console page affects ruby. By default is 437, but 1252 is needed to get accented strings to work.**

                      Further review of the used Windows API is needed to find these issues.

                      @unknownuser said:

                      There are no plan to resolve the original problem on 1.8.
                      You must pass the path with Win32 file API's encoding to ruby.

                      I know it's VERY inconvenient for users in Europe, but we cannot break compatibility of commandline/path handling in 1.8 branch.

                      This is in the lines of what I thought. The file / OI classes under windows appear to demand ANSI (1252) to operate. Question is; what happens on Mac systems? I need to poke around on my Mac when I get home.

                      Thomas Thomassen β€” SketchUp Monkey & Coding addict
                      List of my plugins and link to the CookieWare fund

                      1 Reply Last reply Reply Quote 0
                      • T Offline
                        thomthom
                        last edited by 7 Jul 2009, 11:20

                        This is also interesting comment:

                        @unknownuser said:

                        ...
                        This method applies to output routines only, and it is useless here.
                        Actually problem is in the Windows-specific implementation of some Ruby
                        libraries. Ruby reads the environment variable in awful wrong encoding, and
                        works with file system objects in awful wrong encoding, too.
                        ...

                        Thomas Thomassen β€” SketchUp Monkey & Coding addict
                        List of my plugins and link to the CookieWare fund

                        1 Reply Last reply Reply Quote 0
                        • T Offline
                          thomthom
                          last edited by 7 Jul 2009, 11:32

                          Ok, I know I'm going back and forth here, but at the moment .to_ansi might be wrong. .to_ascii might not be complete indication of what's returned, but we know that we get the ASCII characters. .to_single_byte_characters or .to_unsigned_chars is more accurate, but a bit long names.

                          Thomas Thomassen β€” SketchUp Monkey & Coding addict
                          List of my plugins and link to the CookieWare fund

                          1 Reply Last reply Reply Quote 0
                          • T Offline
                            TIG Moderator
                            last edited by 7 Jul 2009, 11:37

                            Why don't we call the script "friendlytext.rb" and call them
                            ruby_friendly(text)
                            and
                            sup_friendly(text)

                            That way
                            FileTest.exist?(ruby_friendly(model.path))
                            returns true

                            and you can turn Ruby made text back to suit SUp using the other form ? ...

                            This also doesn't enter into this ansi/ascii/unicode/utf8 territory which looks like a quicksand... It also doesn't apportion blame !!!

                            thomthom, you seem to be spending more time on this than me... why don't you take it over and decide what to call the methods ? I'd be happy to hand it over...

                            TIG

                            1 Reply Last reply Reply Quote 0
                            • T Offline
                              thomthom
                              last edited by 7 Jul 2009, 11:49

                              @tig said:

                              Why don't we call the script "friendlytext.rb" and call them
                              ruby_friendly(text)
                              and
                              sup_friendly(text)

                              That's a very pragmatic solution to it. I like it!

                              @tig said:

                              thomthom, you seem to be spending more time on this than me... why don't you take it over and decide what to call the methods ?

                              Since I use Norwegian characters that fall into this encoding trap it's rather important to me to know what's going on. I might look into some extra set of helper functions that I'll add to the SKX project. But I'd need some time to work out what's really going on.
                              For now these snippets will provide enough functionalities for most western languages.

                              Thomas Thomassen β€” SketchUp Monkey & Coding addict
                              List of my plugins and link to the CookieWare fund

                              1 Reply Last reply Reply Quote 0
                              • T Offline
                                TIG Moderator
                                last edited by 7 Jul 2009, 12:58

                                thomthom

                                She's all yours...

                                If you want me to remove any early stuff let me know [PM etc]...

                                TIG

                                1 Reply Last reply Reply Quote 0
                                • T Offline
                                  thomthom
                                  last edited by 7 Jul 2009, 13:15

                                  I think it can stick with the friendly names you suggested.
                                  I'm doing more research. Just signed up for a Ruby forum to work out how Ruby behaves. Once I've gathered the info I need I'll make a thread describing the findings.

                                  Thomas Thomassen β€” SketchUp Monkey & Coding addict
                                  List of my plugins and link to the CookieWare fund

                                  1 Reply Last reply Reply Quote 0
                                  • T Offline
                                    thomthom
                                    last edited by 7 Jul 2009, 13:16

                                    FYI, for anyone that (most unlikely) might be following on my ramblings - I've initiated a new thread over at a Ruby forum for further investigations: http://www.ruby-forum.com/topic/191016#833043

                                    Thomas Thomassen β€” SketchUp Monkey & Coding addict
                                    List of my plugins and link to the CookieWare fund

                                    1 Reply Last reply Reply Quote 0
                                    • T Offline
                                      thomthom
                                      last edited by 7 Jul 2009, 17:52

                                      I'm fairly confident that it's ISO 8859-1 which .pack('C*') generates. As long as the characters fits into ISO 8859-1 the UTF-8 can be converted with the pack/unpack methods. If they fall outside, other solutions are needed.

                                      One annoying finding with this is that the Euro symbol seem to be impossible to use. When ruby comes across this in a UTF-8 string it chokes and throws an error.

                                      I'll begin writing up this to something more readable than today's ramblings.

                                      Thomas Thomassen β€” SketchUp Monkey & Coding addict
                                      List of my plugins and link to the CookieWare fund

                                      1 Reply Last reply Reply Quote 0
                                      • T Offline
                                        thomthom
                                        last edited by 7 Jul 2009, 18:15

                                        sigh
                                        Euro can be used - just not typed into the Console. You might get it passed via SU's ruby method if it's used in a material name or component name. though .unpack('U') will return 8364 for it. Way outside the code point for ISO 8859-1. However, if you type in the octal value "\200" you get the Euro in a 1byte length string. So it should be mappable. But pack and unpack doesn't map the Unicode points between 128-160 well. So I'll be looking for a better conversion.

                                        I've heard of Iconv, but that's an Win API call I think - not a solution for Mac.

                                        Thomas Thomassen β€” SketchUp Monkey & Coding addict
                                        List of my plugins and link to the CookieWare fund

                                        1 Reply Last reply Reply Quote 0
                                        • T Offline
                                          TIG Moderator
                                          last edited by 7 Jul 2009, 19:08

                                          I am glad you have taken such a complicated thing over .... πŸ˜„

                                          TIG

                                          1 Reply Last reply Reply Quote 0
                                          • 1
                                          • 2
                                          • 3
                                          • 2 / 3
                                          2 / 3
                                          • First post
                                            23/42
                                            Last post
                                          Buy SketchPlus
                                          Buy SUbD
                                          Buy WrapR
                                          Buy eBook
                                          Buy Modelur
                                          Buy Vertex Tools
                                          Buy SketchCuisine
                                          Buy FormFonts

                                          Advertisement